IAES International Journal of Artificial Intelligence (IJ-AD 
Vol. 11, No. 1, March 2022, pp. 276~283 
ISSN: 2252-8938, DOI: 10.1159 1/jai.v11.i1.pp276-283 Oo 276 


Model optimisation of class imbalanced learning using ensemble 
classifier on over-sampling data 


Yulia Ery Kurniawati, Yulius Denny Prabowo 


Department of Informatics, Faculty of Computers Science and Design, Institut Teknologi dan Bisnis Kalbis, Jakarta, Indonesia 


Article Info 


ABSTRACT 


Article history: 


Received May 28, 2021 
Revised Dec 23, 2021 
Accepted Jan 4, 2022 


Keywords: 


Adaptive synthetic-nominal 
class imbalance learning 
Ensemble classifier 
Over-sampling 

Synthetic minority 
oversampling technique- 


Data imbalance is one of the problems in the application of machine learning 
and data mining. Often this data imbalance occurs in the most essential and 
needed case entities. Two approaches to overcome this problem are the data 
level approach and the algorithm approach. This study aims to get the best 
model using the pap smear dataset that combined data levels with an 
algorithmic approach to solve data imbalanced. The laboratory data mostly 
have few data and imbalance. Almost in every case, the minor entities are 
the most important and needed. Over-sampling as a data level approach used 
in this study is the synthetic minority oversampling technique-nominal 
(SMOTE-N) and adaptive synthetic-nominal (ADASYN-N) algorithms. The 
algorithm approach used in this study is the ensemble classifier using 
AdaBoost and bagging with the classification and regression tree (CART) as 
learner-based. The best model obtained from the experimental results in 
accuracy, precision, recall, and f-measure using ADASYN-N and AdaBoost- 
CART. 
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1. INTRODUCTION 

One of the problems of machine learning and data mining is imbalanced data. Imbalanced occurs 
when there is disproportion among the number of examples of each class in the dataset [1] and usually in the 
most essential and needed entities. It will be a complicated issue when dealing with the multiclass problem. It 
will be hard to acknowledge a priori of the multi-majority and multi-minority classes that should be stressed 
during the learning stage. For example, machine learning in data mining has difficulty classifying minority 
classes or classes with the smallest number of instances because the algorithm assumes that the class 
distribution is balanced. So that in some cases, there are errors in classifying the results for each class. The 
result is errors in the classification of minority classes due to the class imbalance that tends to focus on the 
majority class and ignore the minority class at the time of classification. The imbalanced data can be found in 
many areas such as medical [2], [3], abnormal electricity consumption [4], price forecasting [5], credit 
evaluation [6], and cyanobacteria bloom [7]. 

There are two approaches to solving this problem in dealing with class imbalance: the data level 
approach, the algorithmic approach, and hybrid-based approaches [8], [9]. The data-level approach can use 
the sampling method. This data sampling method is divided into two: the sampling method in the minority 
class (over-sampling) [10], [11], and the majority class sampling method (under-sampling) [12], [13]. 
Meanwhile, the algorithm approach is an approach by designing new algorithms or refining existing 
algorithms, and it uses the ensemble method. Ensemble methods use one set of classifiers to make a 
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prediction. The generalisation ability of the ensemble is generally much stronger than the individual 
ensemble members [8]. There is two ensemble categorisation, parallel and sequential ensemble. The parallel 
ensemble obtains base learners in parallel, for example, bagging [6], [14]. In comparison, the sequential 
ensemble produces base learners sequentially, where the previous base learner influences the next generation 
of learners, for example, by using adaptive boosting (AdaBoost). 

Kurniawati et al. on adaptive synthetic-nominal (ADASYN-N) and adaptive synthetic-KNN 
(ADASYN-KNN) for multiclass imbalance learning on laboratory test dat proposed ADASYN-N in their 
study in 2018 [2]. It can handle nominal data types that ADASYN proposed by He ef al. [11] cannot. This 
study used an over-sampling method to solve cases of class imbalance in the pap-smear result dataset [15]. 
The over-sampling methods used are synthetic minority oversampling technique-nominal (SMOTE-N), 
ADASYN-N, and ADASYN-KNN. The result is that ADASYN-N performed better than SMOTE-N on all 
performance matrices for NBC. 

Fithrasari et al. on handling imbalance data in classification model with nominal predictors in 2020, 
studied handling imbalanced data in classification models with nominal predictors [16]. They used Survei 
Kinerja dan Akuntabilitas Kependudukan Keluarga Berencana dan Pembangunan Keluarga (SKAP KKBPK) 
data Jawa Timur Province in 2018. ADASYN-N, SMOTE-N, and SMOTE-N-ENN were used for 
imbalanced dataset handling then tested using classification and regression trees (CART). ADASYN-N gave 
the best average area under the curve (AUC) compared with SMOTE or synthetic minority oversampling 
technique-nominal edited nearest neighbor (SMOTE-N-ENN). It could increase accuracy from 0.737 to 
0.963. 

Rachburee and Punlumjeak on oversampling technique in student performance classification from 
engineering course, conducted a study to combine oversampling methods with several classifier models [17]. 
The oversampling methods that were used were SMOTE, Borderline-SMOTE, SVMSOTE, and ADASYN. 
The classifier models were applied using MLP, gradient boosting, AdaBoost, and random forest. The result 
was Borderline-SMOTE gave the best result among other models. 

The absence of further research to find the best model based on the pap-smear result dataset [15], so 
this study will combine the over-sampling methods, which is a data-level approach with an algorithm 
approach. This algorithm approach used an ensemble classifier, AdaBoost.M1 and bagging, and based 
learner used CART. This study chose CART because CART and decision tree are unstable learning 
algorithms, and the ensemble method can improve the generalisation performance and accuracy of unstable 
learning algorithms [18]. 


2. METHOD 

This study was how to optimise the model of imbalanced class learning using ensemble learning on 
over-sample data. Figure 1 shows the research flow in this research. The imbalance dataset was the pap 
smear results dataset [15]. Data level and algorithm approach used to optimise class imbalance handling that 
was. The first approach was data level one using oversampling. It uses SMOTE-N [10] and ADASYN-N [2] 
to handle nominal features. SMOTE-N can solve the overfitting problem in random oversampling [10]. 
While ADASYN, the algorithm was proposed by He et al. improve SMOTE to generate the synthesis 
instances based on the idea of adaptively generating minority data samples based on their distribution [11]. 
But, ADASYN can only compute the numerical data, so for this study, ADASYN-N [2] were used because 
the dataset is nominal. The second one is the algorithm approach. It used an ensemble classifier using CART 
combined with AdaBoost and Bagging. Thirty stages of 10-fold cross-validation were used to validate the 
model. It divided the dataset into tenfold, then one-fold will be data test, and the rest will be data training. It 
will repeat until all folds become testing data. The evaluation matrices were accuracy, precision, recall, f- 
score. 


2.1. Dataset 

The dataset used is a dataset of pap smear results conducted by Kurniawati et al. [15]. It was used 
because it has a huge difference between the minority and majority classes. There is no further research to 
improve the classifier's performance on this dataset using over-sampling and ensemble classifiers. The 
dataset contains 38 features: microscopic features of the anatomical pathology results of the Pap smear test 
and 75 instances divided into seven classes. Table 1 is the list of the seven classes in the dataset. 

Figure 2 shows the number of instances from each class in the dataset. It showed that Chronic 
Inflammation had the most amount with 31 instances, and the Ca Cervix Suspect had the lowest amount with 
two ones. Thus, the ratio of the most amount and the lowest one was 31:2. The impact of the imbalance ratio 
for each class is poor performance when classifying the minority class. 
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Imbalance Dataset 
Class Imbalance 
Handling: Data level 
* * 
SMOTE-N ADASYN-N | 
¥ . 
SMOTE-N Dataset ADASYN-N Dataset 
* * * 
CART | | AdaBoost.M1-CART | | Bagging-CART 


Figure |. Research flow 


Table 1. Classes 
No Class 
Chronic Imflamation 
Cervical Polyp 


1 

2 

3 Epidermoid Carcinoma 
4 Normal 
5 
6 
7 


Papillary Adenocarcinoma 
Squamous Carcinoma 
Ca Cervic Suspect 


Papillary 
Normal  Adenocarci 


Imflamation Polyp Carcinoma annie Carcinoma Suspect 
a 


Chronic Cervical Epidermoid Squamous Ca Cervic 


3 2 2 


Figure 2. The instance number of each class 


2.2. Class imbalanced handling 

Class imbalance learning, also known as CIL, is learning with a class imbalance. The dataset has 
class imbalanced if it disproportionates the number of instances from each class in the dataset [19] or one 
class instance is higher than the other [20]. Datasets, where the most common class is less than twice the 
most minor class, will only be slightly unbalanced. In contrast, the dataset with an imbalance ratio of 10: 1 
will be imbalanced, and the dataset with an imbalance ratio of 1000:1 will be very unbalanced [8]. 

The impact of imbalance is the learning and ability in rare classes. There are two aspects of the 
approach in dealing with imbalanced datasets, namely data level and algorithms [8], [21], [22]. The first 
approach to overcoming class imbalance is sampling the minority class (over-sampling). Over-sampling is a 
method of balancing class distribution by randomly replicating instances of minority classes. However, over- 
sampling increases the likelihood of overfitting occurring because it duplicates the instances exactly. In 2002, 
Chawla et al. [10] proposed a solution to deal with overfitting in the over-sampling method, namely SMOTE. 
SMOTE makes use of the nearest neighbours and the desired amount of over-sampling. Meanwhile, under- 
sampling is a method to balance the class by reducing instances in the majority class randomly. However, the 
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under-sampling method has a disadvantage, namely the loss of information and data that is considered 
necessary for the decision-making process by machine learning. The second approach is an algorithm. One of 
the algorithm approaches is the ensemble method. The ensemble method uses a set of classifiers to make 
predictions. An ensemble's generalizability is generally more robust than individual ensemble members [8]. 
There are two categories of ensembles, namely parallel ensembles and sequential ensembles. Parallel 
ensembles produce parallel base learners, for example, bagging. Consecutive ensembles make base learners 
sequentially, whereas previous base learners have influenced subsequent learner generations, for instance, 
AdaBoost. In this study, data level and algorithm will use to handle the class imbalance problem: the data 
level and algorithm approaches. The data level approach will use an over-sampling method with SMOTE-N 
and ADASYN-N. The algorithm approach will use ensemble methods. 


2.2.1. SMOTE-N 

SMOTE-N is a development of SMOTE used for nominal features with nominal features proposed 
by Chawla as the development of SMOTE [10]. At SMOTE-N, the modified version value difference metric 
(VDM) was proposed by Cost and Salzberg. It was used to calculate the nearest neighbour. New set feature 
values can be created by taking the majority vote of the feature vector considering its k nearest neighbour to 
generate new minority class feature vectors. 


2.2.2. ADASYN-N 

ADASYN is a method for oversampling approach to learning with an unbalanced dataset proposed 
by He et al. [11]. The main idea of ADASYN is to use distribution weights for data on minority classes based 
on the level of learning difficulty. Synthesised data are generated from minority classes that are difficult to 
learn compared to minority data that are easier to learn. ADASYN enhances learning in two ways. First, it 
reduces the bias caused by class imbalances, and the second adaptively shifts the boundaries of classification 
decisions towards data difficulty. ADASYN-N is a development of ADASYN with a nominal type data 
approach called ADASYN-N developed by Kurniawati et al. [2]. The nearest neighbour in ADASYN-N was 
calculated using a modified version of VDM as in SMOTE. 


2.2.3. Ensemble methods 

The ensemble method trains base learners from the training data to make predictions and then 
combine them to make the final decision. In contrast, the standard machine learning method only produces 
one learner [8]. An ensemble can increase the learner with better performance than random guess into the 
learner with strong generalizability and very successful in many machine learning challenges for real-life 
applications [23]. 

The base learner is often referred to as the weak learner. It indicates that the base learner can have 
weak generalizability in the ensemble method. However, most learning algorithms, such as decision trees, 
neural networks, or other machine learning methods, can be called to train the base learner, and ensemble 
methods can improve performance [8]. 

The ensemble method can be categorised into parallel and sequentially based on how the base 
learner is generated. The parallel one produces the base learner in parallel, for example, Bagging. The 
sequentially one produces the base learner sequentially, where the base learner influences the next 
generation, for instance, AdaBoost. 

Bagging is a method for generating multiple versions of a predictor and using predictors to get a 
combined predictor [24]. Bagging is a representation of a parallel ensemble method. Bagging should be used 
with an unstable learner, for example, a decision tree, because the more unstable the base learner is, the better 
its performance will be. However, it turned out that bagging resulted in a combined model that performed 
better than a single model built from the original training data and never substantially worse off [25]. Here is 
the Bagging pseudocode [26]: 


Algorithm bagging for classification 

Input: S: Training set; T: number of iterations; n: number of bootstrap; I: weak learner 
Output: Bagged classifier: H(x) =sign(XL,h,(x)) where h, €[-1,1] is induced classifier 

for t=1 to T do 

S,; -RandomSampleReplacement (n,S) 

h,  1(S;) 

end for 


AdaBoost is a machine learning technique to improve the performance of the weak learner. The 
method called the weak learner iteratively, the training data used is taken from several subsets of the entire 
database. A single robust classifier is then constructed by combining the resulting weak learner with the 
resampling training set [27]. There are many boosting variations, one of which is AdaBoost.M1, specially 
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designed for classification. Adaboost.M1 is a simple generalisation of AdaBoost for more than two classes or 
multiclass [1], [27], which has the same algorithm as AdaBoost for the multiclass base instead of the binary 
learner. Here is the AdaBoost.M1 pseudocode [26]: 


Algorithm Adaboost.M1 


Input: Training set S={x,y,}, i=1,..,N; dan y, €C,C= {q,...,Cm}; T: number of iterations; I: weak 
learner 
Output: Boosted classifier: 


H(x) = argmax Df. In (=~) [he(x) = y] where h,B, is the induced classifier (with h,(x)€C) and give 
weight to each 
D,(i) <= for i=1,..,N 
for t=1 to T do 
h, — 1(S,D;) 
N 


&— ) Deed # vil 


i= 
if ¢,>0,5 then 
Pee, 
return 
end if 


= &t 
fe> pas, 


DO = DG) BOP ow f= Lea N 
Normalise D,,, for the proper distribution 
end for 


2.2.4. CART 

Classification and regression tree algorithm or CART is a regression tree and classification tree 
method that will produce a classification tree if it consists of categorical attributes and create a regression tree 
if it consists of continuous attributes [28]. CART will select several attributes and interactions between the 
most dominant attributes in determining the attributes’ results depending on the binary sorting procedure. In 
choosing the best splitter, CART strives to maximise the average purity of the two child nodes. The way to 
measure purity can be selected freely, and it can be by the criteria of splitting or the splitting function. The 
most common splitting function is the Gini index. Gini index calculation as shown in (1): 


Gini(t) = 1 — Y$=3[p(i|t)]? (1) 


where P(i|t) is the relative frequency of class i at node t, and c is the number of classes. Thus, the 
calculation will get the highest if the distribution is from a uniform class and the smallest if it contains 
identical class records. 


2.2.5. Evaluation 

The confusion matrix is used to calculate the evaluation matrices such as accuracy, recall, precision, 
f-score, and receiver operating character (ROC) area as evaluation matrices. Table 2 shows the confusion 
matrix. TP is the condition where the classifier correctly classifies the positive result. Otherwise, TN is a 
condition where the classifier correctly classifies a negative result. Meanwhile, FP is a condition in which the 
classifier identifies a positive result as negative, and FN is a condition where the classifier identifies a 
negative result as positive. 


Table 2. Confusion matrix 
Actual class Predicted class 
Positive Negative 
Positive True positive (TP) False negative (FN) 
Negative False positive (FP) True negative (TN) 


The calculation of accuracy, recall, precision, and f-score using (2)-(5): 


TP+TN 


Accuracy = ———_——_—_ 
Y = TPEFPLTNGEN 


(2) 
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Precision = —— (3) 
TP+FP 
Recall = —"— (4) 
TP+FN 


resisi xX recall 

f= score = (1 = B?) a es presisi)+recall ©) 

ROC area, commonly known as the AUC, is a technical standard for classifier evaluation. The wider 
the AUC area, the better the model and the excellent interpretation of the probability that the classifier ranks 
randomly selected positive instances over randomly selected negative instances [26]. Every curve on the 
ROC curve represents the performance of different classifiers in the dataset. The X-axis represents the false 
positive rates (FPR), and Y-axis represents the true positive rates (TPR). FPR and TPR calculate using (6) 
and (7): 


FP 
TPR = —-— (7) 
(TP+FN) 


3. RESULTS AND DISCUSSION 

New datasets were generated using SMOTE-N and ADASYN-N then tested using CART, 
AdaBoost-CART, and Bagging-CART. Implementation with the classifier using 30-Stages 10-Cross Fold 
Validation. The evaluation matrices used were accuracy, recall, precision, f-score, and ROC area. Table 3 
shows the algorithm configuration. 


Table 3. Algorithm configuration 


: Configuration 
Alena Variable Value 
SMOTE-N KNN 
TN Adjusted to the number of instances 
ADASYN-N KNN 5 
din 0.75 


1 


Both SMOTE-N and ADASYN-N used the five nearest neighbours. ADASYN-N used 0.75 for d,;, 
or maximum tolerance level of imbalance class ratio and | for g or level balance. As for SMOTE-N, the 
value of the %N adjusted depends on the number of instances generated by ADASYN-N. Figure 3 shows 
each optimisation's performance from both datasets and all the classifications used in this study. The 
optimisation using data level or data level and algorithm could improve the accuracy from 89.34% to 
96.39%. The imbalance dataset using CART had the lowest mean accuracy, equal to 77.87%. The best 
accuracy used ADASYN-N with AdaBoost-CART as the classifier obtained 96.39%. The mean accuracy 
from SMOTE-N and ADASYN-N datasets are better than the imbalanced dataset. 


100.00% 
90.00% 
80.00% 
70.00% 
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 

0.00% 


Imbalance a SMOTE- 
Imbalance Dataset - SMOTE- N- 


Imbalance 
Dataset - | 4 daboost cegeaeas ade Adaboost 
N-CART Bagging- 
M1- CART Ml1- 
CART CART 
77.87% 79.82% 79.70% 89.34% 90.15% 87.73% 94.67% 94.96% 96.39% 


Dataset - Bagging- 
CART | CART 


Figure 3. Mean accuracy 
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Table 4 shows all the performance matrices used in this study, such as precision, recall, f-measure, 
and ROC Area. ADASYN-N with AdaBoost-CART had the best performance matrices in precision, recall, 
and f-measure. However, the best ROC area was obtained by ADASYN-N with Bagging-CART. It was 
reasonable because the model obtained using Boosting is iterative. Therefore, the new model was affected by 
the previous model's performance. It encourages new models to become experts for instances that the 
previous model correctly handles by assigning a greater weight to their instances. 


Table 4. Evaluation 
Model Precision Recall F-Measure | ROC Area 
Imbalance Dataset - CART 67.16% 77.87% 71.88% 84.25% 
Imbalance Dataset - Bagging-CART 72.29% 79.82% 75.45% 90.50% 
Imbalance Dataset - AdaboostM1-CART 80.29% 79.70% 79.47% 91.67% 


SMOTE-N - CART 90.05% 89.34% 89.19% 95.11% 
SMOTE-N - Bagging-CART 90.71% 90.15% 90.01% 95.74% 
SMOTE-N - AdaboostM1-CART 87.82% 87.73% 87.67% 95.83% 
ADASYN-N - CART 95.13% 94.67% 94.42% 98.29% 
ADASYN-N - Bagging-CART 95.45% 94.96% 94.72% 99.22% 
ADASYN-N - AdaboostM1-CART 96.59% 96.39% 96.27% 98.56% 


4. CONCLUSION 

Laboratory test data, such as a dataset of pap smear results, most have little data and imbalance. 
Almost in every case, the least entities are the most important and needed. Based on this study's results, the 
optimisation of Class Imbalanced Learning using both data level and algorithm using ensemble classifier on 
over-sampling data could increase all the evaluation matrices performance on laboratory test data. ADASYN- 
N is better than SMOTE-N for over-sampling the dataset used in this study. The best model was obtained 
using ADASYN-N with AdaBoost-CART. Moreover, this study will use another based-learner besides 
CART to get the best model for imbalance laboratory test data. 
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